Exploratory Analysis on Wine Quality by Jayant Sahewal

Citations

In general I referred to google, stackoverflow and R documentation for overcoming the challenges which I faced. However, I would like to mention these specific URLs which I found most helpful.

Data Summary Section

First, I will load red and white wine datasets, combine them using a new variable color and then, do a basic variable transformation wherever needed. Finally, display a summary of the loaded data.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "color"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality         color          
##  Min.   : 8.00   Min.   :3.000   Length:6497       
##  1st Qu.: 9.50   1st Qu.:5.000   Class :character  
##  Median :10.30   Median :6.000   Mode  :character  
##  Mean   :10.49   Mean   :5.818                     
##  3rd Qu.:11.30   3rd Qu.:6.000                     
##  Max.   :14.90   Max.   :9.000
## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : chr  "red" "red" "red" "red" ...

Observations from the Summary

Mean residual sugar level is 5.4 g/l, but there is a sample of very sweet wine with 65.8 g/l (an extreme outlier). Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75% is 41 ppm. PH of wine is within range from 2.7 till 4, mean 3.2. There is no basic wines in this dataset. Alcohol: lightest wine is 8%, strongest is 14.9. Minimum quality mark is 3, mean 5.8, highest is 9.

Univariate Plots Section

In this section, I will plot histograms for all the variables by color and show a summary to get a general sense of the dataset. For plotting the histograms I found a function which can give optimal binwidth. I know the solution is not perfect but I checked the histograms with manual binwidths after plotting the variables couple of times and the solution was very close to the manual histograms. I believe this technique will be really helpful in exploring a number of other datasets where we can plot the variables one by one really quickly.

Quality of Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000

From the above summary and plot it is evident that for both colors it’s a normal distribution even when the number of samples are very different for each color. Though from the variable descriptions, quality is supposed to follow the range 1 - 10. However, there are no wines with 1, 2 or 10 quality.

Level of Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

Alcohol level distribution looks skewed. red wine sample gives the same pattern of alcohol level distribution as while wines. Most frequently wines have 9.5%, mean is 10.5% of alcohol.

Level of Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900

Fixed Acidity distribution looks normal and both the wines follow somewhat similar pattern. The wines have extreme outliers at 3.8 and 15.

Level of Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800

Volatile Acidity distribution looks normal for white wine while it is very spread for red wine. and However, from the histogram it is clear that red wines have more volatile acidity in general than the white wines. From the summary, we can see an extreme outlier at 1.58

Level of Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600

Citric Acidity for red wine has multiple peaks and same goes for white wine. From the summary, we can see there is an extreme outlier at 1.66.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

Residual distribution looks skewed. red wine sample gives the same pattern of alcohol level distribution as while wines. There is a very very sweet wine in our sample at 65.8 g/dm^3.

Level of Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

Chlorides distribution looks normal both for white and red wines. However, it appears to be shifted for higher value of chlorides.

Level of Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00

Free Sulfur Dioxide distribution looks normal for white wine and skewed for red wine. There is an extreme outlier at 289.

Level of Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0

Total Sulfur Dioxide distribution looks normal for white wine and skewed for red wine which is in accordance with the Free Sulfur Dioxide distribution. Here as well, we can see one outlier at 440.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390

Density Distribution for red wine looks normal though for white wine it is very close to a normal distribution but it is slightly skewed. Density ranges between 0.9871 and 1.0390 g/cm^3.

Level of pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010

pH Distribution for both red and white wine looks normal in nature. From the histogram it looks like red wine has more pH in general i.e. less acidic in nature.

Level of Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000

Sulphate distribution for both red and white wine samples look normal. Though here as well, we can see an extreme outlier at 2.0

Univariate Analysis

What is the structure of your dataset?

For this project, I have combined red and white datasets offered in Data Set Options document. After combining the datasets, initially there were 6497 observations i.e. wine samples. Variable X defines the sample number and is of no significance to this analysis. Each sample has been graded on quality from 1 to 10 (1 being the worst quality and 10 being the best) though in this dataset there are wines ranging from 3 to 9 for both red and white wines. Quality follows a normal distribution.

What is/are the main feature(s) of interest in your dataset?

The main feature I am concerned with is the quality and I expect to investigate the most prominent features with the strongest effect on the quality. Further, I would like to see which variables are connected to each other.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think alcohol and pH will play a significant role becuase they both play a significant role on the taste of the wine. Before starting with this analysis, I would have thought of age of the wine as another important feature, but surprisingly it is not part of the given features. Anyways, if the rest of the variables will be of importance or not, we shall see as we investigate further.

Did you create any new variables from existing variables in the dataset?

For combining the red and white wine datasets, I created one variable color which tells whether the wine is white or red.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There is difference in number of red (1599) and white samples (4898), their quality followed the same normal distribution. Mostly, the distributions were wider for white wine samples and narrower for red wine samples. Now, this could be because there are less number of red wine samples. In general, red and white wine showed same type of distributions for the rest of the varibles, left skewed or right skewed or normal distribution but there were few variables where the distribution was very significantly different. Two such examples are Volatile Acidity and level of chlorides. In both the cases, white wine distribution is left skewed while red wine distribution is more or less normal. Other than this, I found a very very sweet wine with a residual sugar of 65.8 g/dm^3 which represents a case of an extreme outlier.

Bivariate Plots Section

Pair plots for white wine

Pair plots for red wine

Correlation Matrix for red wine

Correlation Matrix for white wine

This correlation matrix is a 12X12 cut off at -x = y with each square representing the calculated value of the correlation coefficient between the 2 intersecting variables. It’s gradient is measured from 1 to -1 colored from dark blue to dark red respectively. These limits fade to white as the correlation approaches zero. We can match the color of a square to its corresponding place on the legend to understand the approximate correlation of the variables in question.

From the correlation matrix and pair plots, we can see there is strong correlation in the following pairs

  • Alcohol vs Density (for both red and white)
  • Fixed Acidity vs Density (for red wine)
  • Residual Sugar vs Density (for both red and white wine)
  • Residual Sugar vs Alcohol (for white wine)
  • Chlorides vs Density (for both red and white wine)
  • Chlorides vs Sulphates (for red wine)

Now let’s plot these graphs.

From the above plots we can observe the following:

  • With increase in alcohol, there is decrease in density and for both the wines there is a very strong correlation.
  • For red wine, there is increase in density with increase in fixed acidity. For white wine, we can see that there is very weak correlation.
  • With increase in residual sugar, there is increase in density and for both the wines there is a similar correlation.
  • For white wine, there is strong negative correlation between alcohol and residual sugar, while for red wine the relationship is very weak.
  • For chlorides and density, there is positive correlation for both red and white wine. Howver, for white it is stronger than red.
  • For red wine, there is increase in sulphates with increase in chlorides. However, for white wine the relationship is very weak.

Though there is not as strong correlation as above pairs, but I would like to analyze effects of alcohol, density, sulphates, volatile acidity and citric acid on quality which have higher correlation to quality. So now let’s plot these one by one.

From the above plots we can observe the following:

  • For both red and white wine, there is a strong positive corrrelation between quality and alcohol.
  • For both red and white wine, there is a negative correlation between quality and density.
  • For both red and white wine, there is a negative correlation between qualtiy and volatile acidity.
  • For both red and white wine, there is a positive correlation between quality and citric acid.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

I found the strongest correlation between density and residual sugar at 0.84 for white wine while for the red wine it was between pH and fixed acidity at - 0.68. For both white and red wine, the strongest correlation to quality was alcohol at 0.44 and 0.48 respectively.

Multivariate Plots Section

For this section, I will choose a few pairs of variables for which found a strong correlation and compare them against quality and color.

Citric Acid and Alcohol

In these plots we can notice that most of red wine is spread out evenly, for white wine citric acid level is concentrated in 0.2 - 0.4 range.

pH and Alcohol

pH and Alcohol have quite similar distribution for both red and white wine. However, white wines generally start with a pH of 2.9 while most red wines start around 3.1.

Chlorides and Sulphates

From the plot we can see that Sulfates and chlorides for white wine are spread out more than those for red wine.

Volatile Acidity and Alcohol

From this plot, we can see there is a strong relationship between alcohol and volatile acidity for both red and white wines.

Model for quality

From the bivariate analysis, it was clear that quality is strongly affected by alcohol, density, volatile acidity and citric acid. Though from the correlation matrix it was clear that density and alcohol have a strong correlation and so do volatile acidity and citric acid. So, to reduce possible multicolinearity, we should ideally be picking one variable from each pair. For my model, I would choose alcohol and volatile acidity and plot model for red & white wine separately.

Model for red wine

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity, data = subset(df, 
##     color == "red", select = -c(X, color)))
## 
## Coefficients:
##      (Intercept)           alcohol  volatile.acidity  
##           3.0955            0.3138           -1.3836

Model for white wine

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity, data = subset(df, 
##     color == "white", select = -c(X, color)))
## 
## Coefficients:
##      (Intercept)           alcohol  volatile.acidity  
##           3.0173            0.3244           -1.9792

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Strong relationship between alcohol and volatile acidity for both red and white wines, led me to create a linear model for predicting quality.

Were there any interesting or surprising interactions between features?

From the linear model and it’s coefficients it was surprising to see that decrease in the volatile acidity and increase in alcohol content makes it a better wine for both red and white wines even when there are a number of other factors are different. Another thing that I noticed was that the model coefficients for red and white wine quality are very similar.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I made a model using volatile acidity and alcohol content of wine which predicts its quality. I believe tihs model can be made better with a larger dataset. The smaller patterns which I had to neglect could become important with larger datasets and it will be interesting to include those in the model.


Final Plots and Summary

Plot One: Quality of Wine

Description

This is the first plot which I am choosing univariate plot in my final plot section. The reason I choose this plot that this gives us a distribution of quality of wines for red and white wine. The quality with highest count is 6 for white wine while it is 5 for red wine. The distribution for both red and white wines look normal in nature and our dataset is rated mostly between 5 and 6. According to description of quality variable, it is supposed to range between 1 and 10. However, in our sample, I didn’t find 5 any wine rated with quality 1, 2 or 10. The mean quality came out to be 5.636 and 5.878 for red and white wine respectively. For both wines, the median quality was 6.

Plot Two: Density vs Alcohol

Description

For the second plot I choosing one bivariate plot. The reason I choose this plot because Density and Alcohol showed the strongest correlation among all wine parameters and this strong correlation led us to exclude Density from our linear model. For white wine, the correlation between alcohol and density is -0.78 while for red wine it is -0.49 which is also evident from the plot in which the line of best fit is with a negative slope i.e with increase in Alcohol, there is a decrease in Density.

Plot Three: Alcohol vs Volatile Acidity

Description

Since I used Alcohol and Volatile acidity for my model, it only makes sense to use it as the final plot and see a general trend of quality over volatile acidity ~ alcohol. We can see that for better quality wines volatile acidity is lesser and they have higher level of alcohol. We can see there is a strong relationship between alcohol and volatile acidity for both red and white wines. For red wine linear model coefficients for Intercept, alcohol and volatile.acidity are 3.0955, 0.3138 and -1.3836. for white wine linear model coefficients for Intercept, alcohol and volatile.acidity are 3.0173, 0.3244 and -1.9792.


Reflection